Add `merge` and `merge_n` algorithms #8753

pepijnve · 2025-10-31T09:22:57Z

Which issue does this PR close?

Closes Provide algorithm that allows zipping arrays whose values are not prealigned #8752.

Rationale for this change

The algorithms suggested in this PR originate from the case logic in DataFusion (see datafusion#18152 and datafusion#18444). I think it might be useful to move them to arrow-rs instead of being tucked away in a corner of the DataFusion codebase.

What changes are included in this PR?

Adds a two-way and n-way merge algorithm that's halfway between zip and interleave. In contrast to zip the truthy and falsy arrays do not need to be prealigned. In contrast to interleave the relative order of elements in each input array is retained in the final result.

Are these changes tested?

I've already added two minimal unit tests, more should probably be added.

Are there any user-facing changes?

No breaking API changes

pepijnve · 2025-10-31T09:27:23Z

The optimisation work that was done in #8653 would make sense here as well. That has not been done yet.

alamb

Thanks @pepijnve -- what do you think about also adding benchmarks to this kernel (so that future optimizations work better)

pepijnve · 2025-10-31T13:42:35Z

what do you think about also adding benchmarks to this kernel

Good idea. I’m happy to continue working on this one. I created the PR already to get the ball rolling and solicit input from other devs.

pepijnve · 2025-11-02T09:36:34Z

The optimisation work that was done in #8653 would make sense here as well. That has not been done yet.

While looking into this I realised that merge on scalars is effectively identical to zip so I resolved this by delegating to zip in case of scalar input

pepijnve · 2025-11-02T09:38:45Z

what do you think about also adding benchmarks to this kernel

@alamb I duplicated the microbenchmark for zip as a quick fix. Is it worth trying to actually share the sets of input data and masks? If so, where should I move that code?

… obvious

martin-g · 2025-11-10T15:14:42Z

arrow-select/src/merge.rs

+///
+/// ```
+pub fn merge_n(values: &[&dyn Array], indices: &[impl MergeIndex]) -> Result<ArrayRef, ArrowError> {
+    let data_type = values[0].data_type();


There is no check for empty values array.

Check added along with unit tests

martin-g · 2025-11-10T15:15:45Z

arrow-select/src/merge.rs

+    let falsy = falsy_array.to_data();
+    let truthy = truthy_array.to_data();
+
+    let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len());


Suggested change

let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len());

let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, mask.len());

arrow-select/src/merge.rs

martin-g · 2025-11-10T15:19:41Z

arrow-select/src/merge.rs

+/// Long spans of null values are also especially cheap because they do not need to be represented
+/// in an input array.
+///
+/// # Safety


Suggested change

/// # Safety

/// # Panics

martin-g · 2025-11-10T15:22:40Z

arrow-select/src/merge.rs

+use arrow_data::transform::MutableArrayData;
+use arrow_schema::ArrowError;
+
+/// An index for the [merge] function.


Suggested change

/// An index for the [merge] function.

/// An index for the [merge_n] function.

martin-g · 2025-11-10T15:24:44Z

arrow/benches/merge_kernels.rs

+        &mut group,
+        &masks,
+        &array_1_10pct_nulls,
+        &non_null_scalar_1,


The arguments here look exactly the same as for array_vs_non_null_scalar above. I think the last two arguments should be swapped.

Indeed. I had copied these from zip_kernel.rs which has the same mistake. Fixing here and in zip_kernel.rs.

pepijnve · 2025-11-10T20:18:46Z

There's a failing test case, but I can't say I see the relationship with this change.

Add merge and merge_n algorithms

eab6202

github-actions bot added the arrow Changes to the arrow crate label Oct 31, 2025

Add license header

462cd3e

pepijnve added 3 commits October 31, 2025 10:31

Formatting and clippy

eefc171

Remove unused import

dc7602a

Fix doc links

8068238

alamb reviewed Oct 31, 2025

View reviewed changes

pepijnve added 3 commits November 2, 2025 10:26

Delegate to zip when both truthy and falsy are scalar

fd3105c

Add merge to compute kernels list

66c8fa0

Duplicate zip benchmark for merge

4286c72

pepijnve added 4 commits November 2, 2025 10:39

Formatting

1d947df

Documentation link fixes

59a733a

Documentation link fixes

10af559

Documentation link fixes

ac68821

pepijnve mentioned this pull request Nov 3, 2025

Avoid scatter operation in ExpressionOrExpression case evaluation method apache/datafusion#18444

Merged

Update example diagram for merge to make difference with zip more…

347e3df

… obvious

alamb mentioned this pull request Nov 4, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 apache/datafusion#18486

Open

47 tasks

Fix clippy warning

9bb40cc

martin-g reviewed Nov 10, 2025

View reviewed changes

pepijnve added 3 commits November 10, 2025 19:24

Correct argument order for non_null_scalar_vs_array benchmark

7d8a078

Use mask.len to avoid reallocations

641eac2

Correct doc comment

f4dcb6c

pepijnve force-pushed the merge branch from 2286110 to f4dcb6c Compare November 10, 2025 18:37

Add check for empty values array and tests

2b51143

	let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len());
	let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, mask.len());

	/// An index for the [merge] function.
	/// An index for the [merge_n] function.

Add merge and merge_n algorithms #8753

Are you sure you want to change the base?

Add merge and merge_n algorithms #8753

Conversation

pepijnve commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pepijnve commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

pepijnve commented Oct 31, 2025

Uh oh!

pepijnve commented Nov 2, 2025

Uh oh!

pepijnve commented Nov 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pepijnve commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `merge` and `merge_n` algorithms #8753

Add `merge` and `merge_n` algorithms #8753

pepijnve commented Oct 31, 2025 •

edited

Loading

pepijnve commented Oct 31, 2025 •

edited

Loading